AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn import metrics, tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score)
%matplotlib inline
from google.colab import drive
# Mount Google Drive
drive.mount('/content/drive')
data = pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')
df = data.copy()
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")
df.columns
df.head()
df.dtypes
df.isnull().sum()
df.nunique()
df.drop(['ID'], axis=1, inplace=True)
df.head()
# Using pandas to get_dummies for one-hot encoding
cat_features = ['Family', 'Education']
model = pd.get_dummies(df, columns=cat_features)
model.head()
df.info()
Questions:
# Histogram for Mortgage distribution
plt.figure(figsize=(5, 5))
sns.histplot(df['Mortgage'], kde=True, bins=50)
plt.title('Distribution of Mortgage')
plt.xlabel('Mortgage Value')
plt.ylabel('Frequency')
plt.show()
# Boxplot to check for outliers in Mortgage
plt.figure(figsize=(5, 5))
sns.boxplot(x=df['Mortgage'])
plt.title('Boxplot of Mortgage')
plt.xlabel('Mortgage Value')
plt.show()
Distribution of Mortgage :
The histogram indicates that there is a right-skewed distribution for the 'Mortgage' attribute. Most customers have low or no mortgage values. There appears to be a much smaller number with higher mortgages. In addition, there are a significant number of outliers indicating that the there are individuals with significantly higher mortgages.
df['CreditCard'].value_counts()
Number of Customers with Credit Cards:
There are 1,470 customers who have one or more credit card(s)
# Correlation with Personal Loan
correlation=df.corr()['Personal_Loan'].sort_values(ascending=False)
Attributes Strongly Correlated with Personal Loan:
Attributes closely linked to the acceptance of a personal loan include 'Income', 'CCAvg' (average spending on credit cards), 'CD_Account' (ownership of a certificate of deposit account), 'Mortgage', and 'Education'. These factors are significant in predicting the probability of a customer opting for a personal loan.
# Count plot for Loan Interest vs Age
plt.figure(figsize=(15, 6))
sns.countplot(x='Age', hue='Personal_Loan', data=df)
plt.title('Loan Interest vs Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
Loan Interest vs Age:
Although the trend is not substantial, the count plot for 'Age' shows that customers of many different ages are interested in personal loans.
# Count plot for Loan Interest vs Education
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='Education', hue='Personal_Loan')
plt.title('Loan Interest vs Education')
plt.xlabel('Education')
plt.ylabel('Count');
Loan Interest v. Education:
The count plot of 'Education' suggests that customer interest in personal loans varies according to their educational backgrounds. It seems that individuals with advanced educational qualifications may demonstrate a marginally greater inclination towards personal loans.
df.describe().T
# Converted negative values in the 'Experience' column to absolute values
df['Experience'] = df['Experience'].abs()
#creating smaller dataframe
columns = df[['Experience', 'Age', 'Income', 'CCAvg', 'Mortgage']]
# Creating the subplot grid
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10, 10))
axes = axes.flatten()
for i, col in enumerate(columns):
sns.histplot(df[col], bins=10, ax=axes[i], alpha=0.5, edgecolor='black', color='skyblue', kde=True)
# Calculating mean and median
mean_value = df[col].mean()
median_value = df[col].median()
# Adding lines for mean and median
axes[i].axvline(mean_value, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_value:.2f}')
axes[i].axvline(median_value, color='blue', linestyle='-', linewidth=2, label=f'Median: {median_value:.2f}')
# Adding a legend
axes[i].legend()
plt.tight_layout() # Adjusts the plots to fit into the figure area.
plt.show()
Experience and age appear is be very similar. Income is right skewed as expected with the median sitting around 64k/yr. CC-avg is also right skewed, which makes sense from the perspective that most people only have 1-3 credit cards.
#See how many 0's are present in the mortgage
print("Number of customers with Mortgages at $0 is", df[df.Mortgage==0].shape[0])
print("Or, approximately", (df[df.Mortgage==0].shape[0]/df.Mortgage.shape[0])*100,"%")
A very large amount of people (or 69.24%) do not have mortgages!! Interesting!
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, fmt=".1f")
plt.show()
Age and experience are seemingly congruent, we could drop one if necessary... Personal loan seems like a high correlation with Personal_Loan, and relatively high with CC-avg. There is some notable findings from CD_Account, but hard to understand if this is promising or just a coincidence.
Additionally, income and CC-avg are correlated, which is interesting.
top_zip_codes = df['ZIPCode'].value_counts().nlargest(20)
sns.barplot(x=top_zip_codes.index, y=top_zip_codes.values)
plt.title('Top 20 ZIP Codes by Number of Customers')
plt.xlabel('ZIP Code')
plt.xticks(rotation=45)
plt.ylabel('Number of Customers')
plt.show()
!pip install uszipcode
!pip install zipcodes
from uszipcode import SearchEngine
search = SearchEngine()
import zipcodes
# Function to get state from a zipcode
def get_state_from_zip(zipcode):
# Retrieve the zipcode object
zipcode_info_1 = search.by_zipcode(zipcode)
# Return the state attribute
return zipcode_info_1.state if zipcode_info_1 else 'NaN'
def get_city_from_zip(zipcode):
# Retrieve the zipcode object
zipcode_info_2 = search.by_zipcode(zipcode)
# Return the state attribute
return zipcode_info_2.city if zipcode_info_2 else 'NaN'
# Apply the function to each row in the 'ZIPCode' column
df['State'] = df['ZIPCode'].apply(get_state_from_zip)
df['City'] = df['ZIPCode'].apply(get_city_from_zip)
print(df['State'].unique())
print(df['City'].nunique())
Understanding more detail around Zip codes is important. Findings indicate, that the data is focused in California. There are approximately 245 cities in the data set.
# Identify the top 10 cities by frequency
top_cities = df['City'].value_counts().head(10)
# Plotting
top_cities.plot(kind='bar')
plt.title('Top 10 Cities by Frequency')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
LA has a high count followed by San Diego, San Francisco, and Berkley. This makes sense because they are both affluent areas and have large populations.
df_cat=df[['CreditCard','Personal_Loan','Family','Education','Securities_Account','CD_Account','Online']]
for columns in df_cat.columns:
print(f"Counts for {columns}:")
print(df_cat[columns].value_counts())
Credit Card Ownership: A total of 5,000 individuals are represented, with 3,530 not owning a credit card (70.6%) and 1,470 owning a credit card (29.4%). Interest in Personal Loans: A small fraction, 480 out of 5,000 (9.6%), have shown interest in personal loans.
Family Size: The distribution of family sizes is relatively uniform, with the smallest family size (1 member) being the most common at 1,472 individuals (29.44%), followed by two-member families (1,296 or 25.92%), four-member families (1,222 or 24.44%), and three-member families being the least common (1,010 or 20.2%).
Education Level: Education level 1 (presumably the lowest level of education) is the most common, with 2,096 individuals (41.92%), followed by level 3 with 1,501 individuals (30.02%), and level 2 with 1,403 individuals (28.06%).
Securities Account Ownership: The majority, 4,478 (89.56%), do not have a securities account, while 522 (10.44%) do, indicating a low prevalence of securities account ownership.
CD Account Ownership: Similar to securities accounts, a small portion, 302 out of 5,000 (6.04%), own a CD account.
Online Banking Usage: A majority, 2,984 (59.68%), use online banking services, suggesting a preference or inclination towards the convenience of online banking among more than half of the individuals.
Overall, the data highlights several key trends: a general reluctance or lack of necessity for personal loans and securities accounts, a slight preference towards not having a credit car and a majority preference for using online banking services. These factors may be important later on when evaluating the big picture for target audience.
df_box=df[['Experience','Age','Income','CCAvg','Mortgage']]
plt.figure(figsize=(10,6))
# Creating a figure and a set of subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(10,10))
# Looping through the columns and creating box plots
for i, columns in enumerate(df_box):
row_index = i // 2 # division to get row index
column_index = i % 2 # Remainder to get column index
sns.boxplot(y=df[columns], ax=axes[row_index, column_index],color='blue',capprops=dict(color='blue'), showmeans=True)
axes[row_index, column_index].set_title(f'{columns} Distribution');
Another view of the distributions of some of variables and the prominent skewing.
# Selecting a subset of variables for clarity
selected_columns = ['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage', 'Personal_Loan']
sns.pairplot(df[selected_columns], hue='Personal_Loan')
plt.show()
Looking at the correlations in a different way. We can see a spread with scatterplot. A very broad stroke appraoch to understanding of the distributions, for instance. Age adn experience, again, appear to be very well correlated. Income does appear to be a determining factor, or displays a sort of splitting point for those with mortgages and those without. Something to consider for later.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Income', y='Mortgage',hue='Personal_Loan', data=df)
plt.title('Mortgage vs. Income')
plt.xlabel('Income')
plt.ylabel('Mortgage')
plt.show()
A closer look at income and mortgage, comments made earlier seem to be accurate with income being a bit of a deciding factor with mortgages.
sns.boxplot(x='Education', y='Income', data=df)
plt.title('Income Distribution by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Income')
plt.show()
Interestingly, income seems to be higher and wider spread in lower educated individuals. However, this may be influenced by the sheer amount of data and more equal distribution. It appears that there are a fair amount of outliers in the higher education groups.
sns.lineplot(x='Age', y='Experience', data=df)
plt.title('Age vs Experience')
plt.xlabel('Age')
plt.ylabel('Experience')
plt.show()
# Preparing the data
Edu_CC = df.groupby('Education')['CreditCard'].value_counts(normalize=True).unstack()
# Plotting
Edu_CC.plot(kind='bar', stacked=True)
plt.title('Credit Card Ownership by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Proportion')
plt.show()
Education and Credit cards do not appear to be a good indicator together.
df.groupby('Personal_Loan')['Income'].describe()
Interestingly much higher incomes for those with mortagages (both median and mode)
df_Loan_1 = df[df['Personal_Loan'] == 1]
df_Loan_0 = df[df['Personal_Loan'] == 0]
# Plotting scatter plots for Income of customers with and without personal loans
plt.figure(figsize=(12, 6))
# Scatter plot for customers with personal loans
sns.scatterplot(data=df_Loan_1, x=df_Loan_1.index, y='Income', color='blue', label='With Personal Loan')
# Scatter plot for customers without personal loans
sns.scatterplot(data=df_Loan_0, x=df_Loan_0.index, y='Income', color='red', label='Without Personal Loan')
plt.title('Income of Customers With and Without Personal Loans')
plt.xlabel('Customer Index')
plt.ylabel('Income')
plt.legend()
plt.show()
Again, a similar conclusion can be made, with high income being an indicator of personal loans.
# Creating values for the average (and median) income in each group
print('Mean income of customers with a loan', df_Loan_1['Income'].mean())
print('Median income of customers with a loan',df_Loan_1['Income'].median())
print('Mean income of customers without a loan',df_Loan_0['Income'].mean())
print('Median income of customers without a loan',df_Loan_0['Income'].median())
# Creating a histogram on the same chart to show the distribution of Income
plt.figure(figsize=(12, 6))
# Histogram for customers with personal loans
sns.histplot(df_Loan_1['Income'], color='blue', label='With Personal Loan', alpha=0.6, bins=30, kde=True)
# Histogram for customers without personal loans
sns.histplot(df_Loan_0['Income'], color='red', label='Without Personal Loan', alpha=0.6, bins=30, kde=True)
plt.title('Income Distribution of Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Another angle and distribution, attempting to understnad how the skew plays in. Note, that the personal loan population appears to have a more standard distribution and less skew.
# Histogram for customers with personal loans, normalized
sns.histplot(df_Loan_1['Income'], color='blue', label='With Personal Loan', alpha=0.6, bins=15, kde=True, stat="density")
# Histogram for customers without personal loans, normalized
sns.histplot(df_Loan_0['Income'], color='red', label='Without Personal Loan', alpha=0.6, bins=15, kde=True, stat="density")
plt.title('Normalized Income Distribution of Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Density')
plt.legend()
plt.show()
By creating a density plot, we can see the appearance of a 'breaking' point between the two... This may be interesting when looking at incorporating new customers.
# Calculate various percentiles of income for both groups
percentiles = [50,55, 60, 65, 70, 75, 80, 85, 90, 95] # You can adjust these percentiles as needed
income_percentiles_loan = df_Loan_1['Income'].quantile(q=[p/100 for p in percentiles])
income_percentiles_no_loan = df_Loan_0['Income'].quantile(q=[p/100 for p in percentiles])
#displaying table:
print("Income Percentiles for Customers with Personal Loan:")
print(income_percentiles_loan)
print("\nIncome Percentiles for Customers without Personal Loan:")
print(income_percentiles_no_loan)
df_percentiles = pd.DataFrame({
'Percentile': percentiles,
'Income_with_Loan': income_percentiles_loan,
'Income_without_Loan': income_percentiles_no_loan
})
# Plotting the data
plt.figure(figsize=(12, 6))
plt.plot('Percentile', 'Income_with_Loan', data=df_percentiles, marker='o', color='blue', label='With Personal Loan')
plt.plot('Percentile', 'Income_without_Loan', data=df_percentiles, marker='o', color='red', label='Without Personal Loan')
plt.title('Income Percentiles for Customers With and Without Personal Loans')
plt.xlabel('Percentile')
plt.ylabel('Income')
plt.xticks(percentiles)
plt.legend()
plt.show()
The trend of income vs. percentile is a good visual of where the general cut off or opportunity lies with those with high enough income to take on a loan.
# Breaking down the average CC-Avg in each group with and without a loan
print('Mean CC-Avg of customers with a loan', df_Loan_1['CCAvg'].mean())
print('Median CC-Avg of customers with a loan',df_Loan_1['CCAvg'].median())
print('Mean CC-Avg of customers without a loan',df_Loan_0['CCAvg'].mean())
print('Median CC-Avg of customers without a loan',df_Loan_0['CCAvg'].median())
Customers with loans tend to have approximately two more credit cards! Which seems to indicate these individuals are willing to take on more debt
ccavg_percentiles_loan = df_Loan_1['CCAvg'].quantile(q=[p/100 for p in percentiles])
ccavg_percentiles_no_loan = df_Loan_0['CCAvg'].quantile(q=[p/100 for p in percentiles])
#updating the DataFrame for the CCAvg percentiles
df_percentiles['CCAvg_with_Loan'] = ccavg_percentiles_loan
df_percentiles['CCAvg_without_Loan'] = ccavg_percentiles_no_loan
#displaying table:
print("CC-Avg Percentiles for Customers with Personal Loan:")
print(ccavg_percentiles_loan)
print("\nCC-Avg Percentiles for Customers without Personal Loan:")
print(ccavg_percentiles_no_loan)
A more numerical representation to see the correlation that is evident within the split groups of those with loans and those without.
# Creating histograms for income distribution
plt.figure(figsize=(12, 6))
sns.histplot(df_Loan_1['CCAvg'], color='blue', label='With Personal Loan', alpha=0.6, bins=30, kde=True, stat="density")
sns.histplot(df_Loan_0['CCAvg'], color='red', label='Without Personal Loan', alpha=0.6, bins=30, kde=True, stat="density")
plt.title('Normalized CCAvg Distribution for Customers With and Without Personal Loans')
plt.xlabel('Income')
plt.ylabel('Density')
plt.legend()
plt.show()
# Plotting the data
plt.figure(figsize=(12, 6))
plt.plot('Percentile', 'CCAvg_with_Loan', data=df_percentiles, marker='o', color='blue', label='With Personal Loan')
plt.plot('Percentile', 'CCAvg_without_Loan', data=df_percentiles, marker='o', color='red', label='Without Personal Loan')
plt.title('CCAvg Percentiles for Customers With and Without Personal Loans')
plt.xlabel('Percentile')
plt.ylabel('CCAvg')
plt.xticks(percentiles)
plt.legend()
plt.show()
Further visualtions to help us understand how the data is distributed. Certainly some difference compared to income. Perhaps more inclusive of the 'without loan' group, and maybe a good indicator of reliablilty with further information such as credit score!
df_percentiles.describe().T
# Making a copy of the DataFrame
df_Age = df.copy()
# Creating age groups and adding it as a new column
df_Age['Age_Group'] = pd.cut(df_Age['Age'], bins=[18, 24, 30, 36, 42, 50, 56, 62, 68], labels=['18-23', '24-29', '30-35', '36-41', '42-49', '50-55', '56-61', '62-68'])
# Grouping by Age_Group and counting the CreditCard usage
age_group_counts = df_Age.groupby('Age_Group')['CreditCard'].count()
# Print the counts
print(age_group_counts)
# Creating the bar plot with Seaborn
import seaborn as sns
sns.barplot(x=age_group_counts.index, y=age_group_counts.values)
plt.title('Credit Card Count by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Credit Card Count')
A normal distribution is evident with CC count and age group!
# Calculating correlation matrix for the DataFrame
corr_matrix = df.corr()
# Sorting the 'Personal_Loan' column correlations
sorted_corr = corr_matrix['Personal_Loan'].sort_values(ascending=False)
print(sorted_corr)
A good visual of the correlations between the data attributes and personal loan in order of influence.
#removing the add-in, state and city from the orginal data.
df.drop(['State','City'],axis=1,inplace=True)
df.describe()
# breaking apart education and family into 1's and 0's using dummy variables.
df_dummies = pd.get_dummies(df, columns=['Education', 'Family'], drop_first=True)
df_dummies.head()
# Rechecking for null values.
df_dummies.isnull().sum()
#Checking data types
df_dummies.dtypes
We will most likely not apply the dummy variables to the data, but we are utilizing them as a back-up if necessary for further data evaluation or model influences.
More emphasis should be placed on recall since our primary objective is to predict whether the customer will accept a personal loan or not. The bank aims to encourage more customers to accept personal loans, which means reducing the number of False Negatives to ensure that the bank doesn't miss out on genuine customers who wish to borrow. Therefore, our main focus should be on improving recall.
def model_performance_classification_with_confusion_matrix(model, predictors, target):
# Predicting using the independent variables
pred = model.predict(predictors)
# Computing metrics
accuracy = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# Creating a DataFrame of metrics
df_performance = pd.DataFrame({"Accuracy": [accuracy], "Recall": [recall], "Precision": [precision], "F1": [f1]})
# Calculate the confusion matrix
cm = confusion_matrix(target, pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] # Normalize for percentages
labels = np.asarray([["{0:0.0f}\n{1:.2%}".format(value, percentage) for value, percentage in zip(cm.flatten(), cm_normalized.flatten())]]).reshape(2,2)
# Plotting the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=labels, fmt='', cmap='Blues', cbar=False, xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix with Metrics')
# Display performance metrics on the plot
metrics_text = f'Accuracy: {accuracy:.1%}\nPrecision: {precision:.1%}\nRecall: {recall:.1%}\nF1 Score: {f1:.1%}'
plt.text(2.5, 0.5, metrics_text, va='center', ha='center', bbox=dict(facecolor='white', edgecolor='black', boxstyle='round,pad=1'))
return df_performance
## Define X and Y variables
X=df
y=df['Personal_Loan']
X = df.drop('Personal_Loan',axis=1)
X
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Initialize the Decision Tree Classifier with a random state for reproducibility
clf = DecisionTreeClassifier(random_state=1)
# Train the model on the training data
clf.fit(X_train, y_train)
# Predict the target values for the testing set
y_test = clf.predict(X_test)
# Correcting the function call with the right variable names
performance_df = model_performance_classification_with_confusion_matrix(clf, X_test, y_test)
print(performance_df)
As expected, running a standard performance test results in overfit data which effectively counts the true yes's and true no's
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))
out = tree.plot_tree(clf, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
importances = clf.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
#scaling data as LR is effected by outliers
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
# Evaluating performance on the training set
performance_df_train_LR = model_performance_classification_with_confusion_matrix(model, X_train_scaled, y_train)
print("Training Set Performance:\n", performance_df_train_LR)
# Evaluating performance on the test set - Use the scaled test data here
performance_df_test_LR = model_performance_classification_with_confusion_matrix(model, X_test_scaled, y_test) # Corrected to use X_test_scaled
print("Test Set Performance:\n", performance_df_test_LR)
With LR, although accuracy scores are quite high, recall is low. Model is an improvement from "un-optimized", but not ideal.
# Grid Search for the best model parameters
param_grid_1 = {
'max_depth': np.arange(3,21,3).tolist() + [None], # Adding None to allow for unlimited depth
'min_samples_split': np.arange(2, 21, 2).tolist(), # Ranges from 2 to 20, stepping by 2
'min_samples_leaf': np.arange(1, 11, 1).tolist(), # Ranges from 1 to 10, stepping by 1
'max_leaf_nodes': np.arange(10, 101, 10).tolist() + [None] # Ranges from 10 to 100, stepping by 10, and None
}
param_grid_2 = {
"max_depth": np.arange(3,21,3).tolist() + [None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.000001, 0.00001, 0.0001],
}
clf_pre_pruned = DecisionTreeClassifier(random_state=1)
grid_search_1 = GridSearchCV(clf_pre_pruned, param_grid_1, cv=5, scoring='recall') # Focus on recall to catch as many positives as possible
grid_search_1 = grid_search_1.fit(X_train, y_train)
grid_search_2 = GridSearchCV(clf_pre_pruned, param_grid_2, cv=5, scoring='recall') # Focus on recall to catch as many positives as possible
grid_search_2 = grid_search_2.fit(X_train, y_train)
print("Best parameters found by grid search:", grid_search_1.best_params_)
print("Best parameters found by grid search:", grid_search_2.best_params_)
estimator_1 = grid_search_1.best_estimator_
estimator_2 = grid_search_2.best_estimator_
# Fit the best algorithm to the data.
estimator_1 = estimator_1.fit(X_train, y_train)
estimator_2 = estimator_2.fit(X_train, y_train)
print(estimator_1)
print(estimator_2)
Altough pre-prune 2 (grid search parameters 2) has a relatively deep depth (9) compared to pre-prune 1 (grid search parameters 1; 6), it may still be the best model. Although, if the performance is close, the more simpler pre-pruned model may be ideal.
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))
out = tree.plot_tree(estimator_1, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
column_names = list(X.columns)
feature_names = column_names
plt.figure(figsize=(15, 20))
out = tree.plot_tree(estimator_2, feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True,)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Optimized_pre_prune_1_train = model_performance_classification_with_confusion_matrix(estimator_1, X_train, y_train)*100
Optimized_pre_prune_2_train = model_performance_classification_with_confusion_matrix(estimator_2, X_train, y_train)*100
print("Train Performance of Estimator 1:\n", Optimized_pre_prune_1_train,"\n")
print("Train Performance of Estimator 2:\n", Optimized_pre_prune_2_train)
Both models appear to be quite good. However, pre-pruning 2 minimizes the True yes's predicted as No's better.
Optimized_pre_prune_1_test = model_performance_classification_with_confusion_matrix(estimator_1, X_test, y_test)*100
Optimized_pre_prune_2_test = model_performance_classification_with_confusion_matrix(estimator_2, X_test, y_test)*100
print("Test Performance of Estimator 1:\n", Optimized_pre_prune_1_test)
print("Test Performance of Estimator 2:\n", Optimized_pre_prune_2_test)
The test data is very comparable to the training data in both pre-pruning grids, which apparent that these are excellent models!
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced")
clf.fit(X_train, y_train)
clfs.append(clf)
#developing recall testing on training data
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
#developing recall testing on test data
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
#scores
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
#plot
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
It appears the train and test data both have high recall with a low alpha (which is difficult to see from the visual and must be determined what the best model is based on the alpha).
best_model = np.argmax(recall_test)
best_model = clfs[best_model]
print(best_model)
The ideal alpha appears to be ~0.29. We could also run a different class weight (or remove), but we will determine if it is necessary based on the results of the performance test of the 'best model'
Optimized_post_prune_train = model_performance_classification_with_confusion_matrix(clf, X_train, y_train)*100
Optimized_post_prune_test = model_performance_classification_with_confusion_matrix(clf, X_test, y_test)*100
print("Test Performance of Training Post Prune:\n", Optimized_post_prune_train)
print("Test Performance of Test Post Prune:\n", Optimized_post_prune_test)
Although recall is maximized. All results have been predicted as 'yes' and therefore the data is overfitting.
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
clf,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# training performance comparison
# Concatenate the DataFrames along columns (axis=1)
models_train_comp_df = pd.concat(
[
performance_df_train_LR.T,
performance_df.T,
Optimized_pre_prune_1_train.T,
Optimized_pre_prune_2_train.T,
Optimized_post_prune_train.T
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression Train",
"Decision Tree Unadjusted",
"Decision Tree (Pre-Pruning 1) Train",
"Decision Tree (Pre-Pruning 2) Train",
"Decision Tree (Post-Pruning) Train"
]
print("Train performance comparison:")
models_train_comp_df
# test performance comparison
# Concatenate the DataFrames along columns (axis=1)
models_test_comp_df = pd.concat(
[
performance_df_test_LR.T,
performance_df.T,
Optimized_pre_prune_1_test.T,
Optimized_pre_prune_2_test.T,
Optimized_post_prune_test.T
],
axis=1,
)
models_test_comp_df.columns = [
"Logistic Regression",
"Decision Tree Unadjusted",
"Decision Tree (Pre-Pruning 1) Test",
"Decision Tree (Pre-Pruning 2) Test",
"Decision Tree (Post-Pruning) Test"
]
print("Test performance comparison:")
models_test_comp_df
Based on the recall scores (primarily) in the test and training data model results. It has been established that Pre-Pruning 2 model is the best model. This model utilizes criterion, impurity and splitter hyperparameters, which led to the most optimal model.
The Bank should target customers with higher incomes : The analysis showed that customers with higher income levels are more likely to accept personal loans. Marketing efforts could be more focused on higher income individuals and families.
The Bank should target customers with higher CC-Avg: Similarly, customers with higher average credit card count (CC-Avg) demonstrated a higher likelihood of taking personal loans. These customers are likely more comfortable with debt management.
Consider Customer's Education Level: Based on the modeling data, customers with a higehr education level are more likely or are more influenced in potentially taking out a loan. Whether there is more comfort from education or education requires loans, this is a great area to target.
Utilize Customer's Banking Behaviors: Customers who already have a securities account / CD account with the bank show higher conversion rates to loans. These services indicate a customer's trust and relationship with the bank, which could be utilized as a promotional offering for personal loans
Consider Age and Experience: While age and experience alone didn't show a clear advantage for personal loan acceptance, combining these with other factors like income and education could help in segment the customer clusters better.
Consider Developing marketing campaigns that focus on the customers identified through the model. Further clustering analysis may help to segment these customers better, however, utilizing data on income and targeting these customers, education, CC-avg, or those in school may be promising. Utilize variables such as age and other factors (potentially credit history) to help establish more detailed guidelines.
Begin collecting more data on other factors that effect loan acquisition, such as: credit history (if possible), interest rates compared to to loan acceptance, perhaps compounding interest differences compared to new accounts.
In addition, perhaps there are opportunities for refinancing of current debt or consolidating debt for new graduate or other segments.
Tailor personal loan offers based on the customer's financial profile. For example, offer competitive interest rates or flexible repayment terms for high-income customers or those with a good track record of credit card usage.
Continuously monitor the performance of the personal loan campaigns and the predictive models. Adapt strategies as the overall economy changes. Economic changes likely have a major impact on loan acquisition.